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Abstract 

We consider the fundamental problem of prediction with expert advice where the experts are 
“optimizable”: there is a black-box optimization oracle that can be used to compute, in constant 
time, the leading expert in retrospect at any point in time. In this setting, we give a novel online 
algorithm that attains vanishing regret with respect to N experts in total 0{'\fN) computation 
time. We also give a lower bound showing that this running time cannot be improved (up to 
log factors) in the oracle model, thereby exhibiting a quadratic speedup as compared to the 
standard, oracle-free setting where the required time for vanishing regret is Q{N). These results 
demonstrate an exponential gap between the power of optimization in online learning and its 
power in statistical learning: in the latter, an optimization oracle—i.e., an efficient empirical 
risk minimizer—allows to learn a finite hypothesis class of size N in time O(logiV). 

We also study the implications of our results to learning in repeated zero-sum games, in a set¬ 
ting where the players have access to oracles that compute, in constant time, their best-response 
to any mixed strategy of their opponent. We show that the runtime required for approximating 
the minimax value of the game in this setting is 0(-\/]V), yielding again a quadratic improvement 
upon the oracle-free setting, where 0(A) is known to be tight. 


1 Introduction 

Prediction with expert advice is a fundamental model of sequential decision making and online 
learning in games. This setting is often described as the following repeated game between a player 
and an adversary: on each round, the player has to pick an expert from a fixed set of N possible 
experts, the adversary then reveals an arbitrary assignment of losses to the experts, and the player 
incurs the loss of the expert he chose to follow. The goal of the player is to minimize his T-round 
average regret, defined as the difference between his average loss over T rounds of the game and the 
average loss of the best expert in that period—the one having the smallest average loss in hindsight. 
Multiplicative weights algorithms (Littlestone and Warmuth, 1994; Freund and Schapire, 1997; see 
also Arora et ah, 2012 for an overview) achieve this goal by maintaining weights over the experts 
and choosing which expert to follow by sampling proportionally to the weights; the weights are 
updated from round to round via a multiplicative update rule according to the observed losses. 

While multiplicative weights algorithms are very general and provide particularly attractive 
regret guarantees that scale with log A, they need computation time that grows linearly with N to 
achieve meaningful average regret. The number of experts N is often exponentially large in appli¬ 
cations (think of the number of all possible paths in a graph, or the number of different subsets of 
a certain ground set), motivating the search for more structured settings where efficient algorithms 
are possible. Assuming additional structure—such as linearity, convexity, or submodularity of the 
loss functions—one can typically minimize regret in total poly (log A) time in many settings of in¬ 
terest (e.g., Zinkevich, 2003; Kalai and Vempala, 2005; Awerbuch and Kleinberg, 2008; Kazan and 
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Kale, 2012). However, the basic multiplicative weights algorithm remains the most general and is 
still widely used. 

The improvement in structured settings—most notably in the linear case (Kalai and Vempala, 
2005) and in the convex case (Zinkevich, 2003) —often comes from a specialized reduction of the 
online problem to the offline version of the optimization problem. In other words, efficient online 
learning is made possible by providing access to an ojfline optimization oracle over the experts, 
that allows the player to quickly compute the best performing expert with respect to any given 
distribution over the adversary’s losses. However, in all of these cases, the regret and runtime 
guarantees of the reduction need the additional structure. Thus, it is natural to ask whether such 
a drastic improvement in runtime is possible for generic online learning. Specifically, we ask: What 
is the runtime required for minimizing regret given a black-box optimization oracle for the experts, 
without assuming any additional structure? Can one do better than linear time in N? 

In this paper, we give a precise answer to these questions. We show that, surprisingly, an offline 
optimization oracle gives rise to a substantial, quadratic improvement in the runtime required for 
convergence of the average regret. We give a new algorithm that is able to minimize regret in total 
time 0{y/N)f and provide a matching lower bound confirming that this is, in general, the best 
possible. Thus, our results establish a tight characterization of the computational power of black¬ 
box optimization in online learning. In particular, unlike in many of the structnred settings where 
poly(logiV) rnntime is possible, without imposing additional structure a polynomial dependence 
on N is inevitable. 

Our results demonstrate an exponential gap between the power of optimization in online learn¬ 
ing, and its power in statistical learning. It is a simple and well-known fact that for a finite hypothe¬ 
sis class of size N (which corresponds to a set of N experts in the online setting), black-box optimiza¬ 
tion gives rise to a statistical learning algorithm—often called empirical risk minimization—that 
needs only 0(log N) examples for learning. Thus, given an offline optimization oracle that optimizes 
in constant time, statistical learning can be performed in time O(logA^); in contrast, our results 
show that the complexity of online learning nsing such an optimization oracle is 0(\/iV). This 
dramatic gap is surprising due to a long line of work in online learning suggesting that whatever 
can be done in an offline setting can also be done (efficiently) online. 

Finally, we study the implication of our results to repeated game playing in two-player zero-sum 
games. The analogue of an optimization oracle in this setting is a best-response oracle for each 
of the players, that allows her to quickly compute the pure action being the best-response to any 
given mixed strategy of her opponent. In this setting, we consider the problem of approximately 
solving a zero-sum game—namely finding a mixed strategy profile with payoff close to the minimax 
payoff of the game. We show that our new online learning algorithm above, if deployed by each of 
the players in an iV x iV zero-sum game, guarantees convergence to an approximate equilibrium in 
total 0{y/N) time. This is, again, a quadratic improvement npon the best possible &{N) runtime 
in the oracle-free setting, as established by Grigoriadis and Khachiyan (1995) and Freund and 
Schapire (1999). Interestingly, it turns out that the quadratic improvement is tight for solving 
zero-sum games as well: we prove that any algorithm would require ^{y/N) time to approximate 
the value of a zero-sum game in general, even when given access to powerful best-response oracles. 

1.1 Related Work 

Online-to-offline reductions. The most general reduction from regret minimization to opti¬ 
mization was introduced in the influential work of Kalai and Vempala (2005) as the Follow-the- 

^Here and throughout, we use the O(-) notation to hide constants and poly-logarithmic factors. 
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Perturbed Leader (FPL) methodology. This technique requires the problem at hand to be embed¬ 
dable in a low-dimensional space and the cost functions to be linear in that space. ^ Subsequently, 
Kakade et al. (2009) reduced regret minimization to approximate linear optimization. For gen¬ 
eral convex functions, the Follow-the-Regularized-Leader (FTRL) framework (Zinkevich, 2003; see 
also Kazan, 2014) provides a general reduction from online to offline optimization, that often gives 
dimension-independent convergence rates. Another general reduction was suggested by Kakade and 
Kalai (2006) for the related model of transductive online learning, where future data is partially 
available to the player (in the form of unlabeled examples). 

Without a fully generic reduction from online learning to optimization, specialized online 
variants for numerous optimization scenarios have been explored. This includes efficient regret- 
minimization algorithms for online variance minimization (Warmuth and Kuzmin, 2006), routing 
in networks (Awerbuch and Kleinberg, 2008), online permutations and ranking (Helmbold and 
Warmuth, 2009), online planning (Even-Dar et ah, 2009), matrix completion (Kazan et ah, 2012), 
online submodular minimization (Kazan and Kale, 2012), contextual bandits (Dudfk et ah, 2011; 
Agarwal et ah, 2014), and many more. 

Computational tradeoffs in learning. Tradeoffs between sample complexity and computation 
in statistical learning have been studied intensively in recent years (e.g., Agarwal, 2012; Shalev- 
Shwartz and Srebro, 2008; Shalev-Shwartz et ah, 2012). Kowever, the adversarial setting of online 
learning, which is our main focus in this paper, did not receive a similar attention. One notable 
exception is the seminal paper of Blum (1990) who showed that, under certain cryptographic 
assumptions, there exists an hypothesis class which is computationally hard to learn in the online 
mistake bound model but is non-properly learnable in polynomial time in the PAG model. ^ In our 
terminology, Blum’s result show that online learning might require a;(poly(log A^)) time, even in 
a case where offline optimization can be performed in poly(log A) time, albeit non-properly (i.e., 
the optimization oracle is allowed to return a prediction rule which is not necessarily one of the N 
experts). 

Solution of zero-sum games. The computation of equilibria in zero-sum games is known to 
be equivalent to linear programming, as was first observed by von-Neumann (Adler, 2013). A 
basic and well-studied question in game theory is the study of rational strategies that converge 
to equilibria (see Nisan et ah, 2007 for an overview). Freund and Schapire (1999) showed that in 
zero-sum games, no-regret algorithms converge to equilibrium. Kart and Mas-Colell (2000) studied 
convergence of no-regret algorithms to correlated equilibria in more general games; Even-dar et al. 
(2009) analyzed convergence to equilibria in concave games. Grigoriadis and Khachiyan (1995) 
were the hrst to observe that zero-sum games can be solved in total time sublinear in the size of 
the game matrix. 

Game dynamics that rely on best-response computations have been a topic of extensive research 
for more than half a century, since the early days of game theory. Within this line of work, perhaps 
the most prominent dynamic is the “fictitious play” algorithm, in which both players repeatedly 
follow their best-response to the empirical distribution of their opponent’s past plays. This simple 
and natural dynamic was first proposed by Brown (1951), shown to converge to equilibrium in two- 
player zero-sum games by Robinson (1951), and was extensively studied ever since (see e.g., Brandt 
et ah, 2013; Daskalakis and Pan, 2014 and the references therein). Another related dynamic, put 

^The extension to convex cost functions is straightforward (see, e.g., Kazan, 2014). 

^Non-proper learning means that the algorithm is allowed to return an hypothesis outside of the hypothesis class 
it competes with. 
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forth by Hannan (1957) and popularized by Kalai and Vempala (2005), is based on perturbed (i.e., 
noisy) best-responses. 

We remark that since the early works of Grigoriadis and Khachiyan (1995) and Freund and 
Schapire (1999), faster algorithms for approximating equilibria in zero-sum games have been pro¬ 
posed (e.g., Nesterov, 2005; Daskalakis et ah, 2011). However, the improvements there are in terms 
of the approximation parameter e rather than the size of the game N. It is a simple folklore 
fact that using only value oracle access to the game matrix, any algorithm for approximating the 
equilibrium must run in time Q{N); see, e.g., Clarkson et al. (2012). 


2 Formal Setup and Statement of Results 

We now formalize our computational oracle-based model for learning in games—a setting which we 
call “Optimizable Experts”. The model is essentially the classic online learning model of prediction 
with expert advice augmented with an offline optimization oracle. 

Prediction with expert advice can be described as a repeated game between a player and an 
adversary, characterized by a finite set X oi N experts for the player to choose from, a set y of 
actions for the adversary, and a loss function £ : T x T ^ [0,1]. First, before the game begins, the 
adversary picks an arbitrary sequence yi,y 2 , ■ ■ ■ of actions from On each round t = 1, 2,..., 
of the game, the player has to choose (possibly at random) an expert xt G X, the adversary then 
reveals his action yt G y and the player incurs the loss £{xt,yt). The goal of the player is to 
minimize his expected average regret over T rounds of the game, defined as 

T 1 T 

i ^ i{xt, yt) - min ^ ^ yt) ■ 

t=i \ t=i 

Here, the expectation is taken with respect to the randomness in the choices of the player. 

In the optimizable experts model, we assume that the loss function i is initially unknown to 
the player, and allow her to access I by means of two oracles: Val and Opt. The first oracle simply 
computes for each pair of actions (x, y) the respective loss i{x,y) incurred by expert x when the 
adversary plays the action y. 

Definition (value oracle). A value oracle is a procedure Val : X x T [0,1] that for any action 
pair X G X , y G y, returns the loss value i{x, y) in time 0(1); that is, 

yxGX,yGy, \/a\{x,y) = £{x,y) . 

The second oracle is far more powerful, and allows the player to quickly compute the best 
performing expert with respect to any given distribution over actions from y (i.e., any mixed 
strategy of the adversary). 

Definition (optimization oracle). An optimization oracle is a procedure Opt that receives as input 
a distribution q G A(T), represented as a list of atoms {(i,(?i) : qi > 0}, and returns a best 
performing expert with respect to q (with ties broken arbitrarily), namely 

V g G A(T) , Opt(g) G argmin Ey..^q[^(x,y)] . 

The oracle Opt runs in time 0(1) on any input. 

“^Such an adversary is called oblivious, since it cannot react to the decisions of the player as the game progresses. 
We henceforth assume an oblivious adversary, and relax this assumption later in Section 4. 


R{T) = E 
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Recall that our goal in this paper is to evaluate online algorithms by their runtime complexity. 
To this end, it is natural to consider the running time it takes for the average regret of the player 
to drop below some specified target threshold.^ Namely, for a given e > 0, we will be interested in 
the total computational cost (as opposed to the number of rounds) required for the player to ensure 
that R{T) < e, as a function of N and e. Notice that the number of rounds T required to meet the 
latter goal is implicit in this view, and only indirectly affects the total runtime. 

2.1 Main Results 

We can now state the main results of the paper: a tight characterization of the runtime required 
for the player to converge to e expected average regret in the optimizable experts model. 

Theorem 1. In the optimizable experts model, there exists an algorithm that for any e > 0, 
guarantees an expected average regret of at most e with total runtime of 0{\/~N/e^). Specifically, 
Algorithm 2 (see Section 3.2) achieves 0{N^/^/^/T) expected average regret over T rounds, and 
runs in 0(1) time per round. 

The dependence on the number of experts N in the above result is tight, as the following 
theorem shows. 

Theorem 2. Any (randomized) algorithm in the optimizable experts model cannot guarantee an 
expected average regret smaller than ^ in total time better than 0{^/N). 

In other words, we exhibit a quadratic improvement in the total runtime required for the 
average regret to converge, as compared to standard multiplicative weights schemes that require 
0{N/e^) time, and this improvement is the best possible. Granted, the regret bound attained by the 
algorithm is inferior to those achieved by multiplicative weights methods, that depend on N only 
logarithmically; however, when we consider the total computational cost required for convergence, 
the substantial improvement is evident. 

Our upper bound actually applies to a model more general than the optimizable experts model, 
where instead of having access to an optimization oracle, the player receives information about the 
leading expert on each round of the game. Namely, in this model the player observes at the end of 
round t the leader 


t 

x*t = aigminy^ £{x,ys) (1) 

as part of the feedback. This is indeed a more general model, as the leader can be computed 
in the oracle model in amortized 0(1) time, simply by calling Opt(yi,... ,yt). (The list of actions 
yi,... ,yt played by the adversary can be maintained in an online fashion in 0(1) time per round.) 
Our lower bound, however, applies even when the player has access to an optimization oracle in its 
full power. 

Finally, we mention a simple corollary of Theorem 2: we obtain that the time required to attain 
vanishing average regret in online Lipschitz-continuous optimization in Euclidean space is expo¬ 
nential in the dimension, even when an oracle for the corresponding offline optimization problem 
is at hand. For the precise statement of this result, see Section 5.1. 

®This is indeed the appropriate criterion in algorithmic applications of online learning methods. 
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2.2 Zero-sum Games with Best-response Oracles 

In this section we present the implications of our results for repeated game playing in two-player 
zero-sum games. Before we can state the results, we first recall the basic notions of zero-sum games 
and describe the setting formally. 

A two-player zero-sum game is specified by a matrix G G [0, in which the rows correspond 

to the (pure) strategies of the first player, called the row player, while the columns correspond to 
strategies of the second player, called the column player. For simplicity, we restrict the attention 
to games in which both players have N pure strategies to choose from; our results below can be 
readily extended to deal with games of general (finite) size. A mixed strategy of the row player is 
a distribution p G A^r over the rows of G; similarly, a mixed strategy for the column player is a 
distribution q £ Ajy over the columns. For players playing strategies {p,q), the loss (respectively 
payoff) suffered by the row (respectively column) player is given by p^Gq. A pair of mixed strategies 
(p, q) is said to be an approximate equilibrium, if for both players there is almost no incentive in 
deviating from the strategies p and q. Formally, (p, q) is an e-equilibrium if and only if 

V 1 < i, j < A" , p^Gcj — e < q7Gq < ^Gq + e . 

Here and throughout, e* stands for the i’th standard basis vector, namely a vector with 1 in its i’th 
coordinate and zeros elsewhere. The celebrated von-Neumann minimax theorem asserts that for 
any zero-sum game there exists an exact equilibrium (i.e., with e = 0) and it has a unique value, 
given by 

A(G) = min max p^Gq . 

A repeated zero-sum game is an iterative process in which the two players simultaneously 
announce their strategies, and suffer loss (or receive payoff) accordingly. Given e > 0, the goal of 
the players in the repeated game is to converge, as quickly as possible, to an e-equilibrium; in this 
paper, we will be interested in the total runtime required for the players to reach an e-equilibrium, 
rather than the total number of game rounds required to do so. 

We assume that the players do not know the game matrix G in advance, and may only access 
it through two types of oracles, which are very similar to the ones we defined in the online learning 
model. The first and most natural oracle allows the player to query the payoff for any pair of pure 
strategies (i.e., a pure strategy profile) in constant time. Formally, 

Definition (value oracle). A value oracle for a zero-sum game described by a matrix G G [0, 

is a procedure Val that accepts row and column indices f, j as input and returns the game value for 

the pure strategy profile (i,j), namely: 

y 1 <i,j < N , Val(i,j) = Gij . 

The value oracle runs in time 0(1) on any valid input. 

The other oracle we consider is the analogue of an optimization oracle in the context of games. 
For each of the players, a best-response oracle is a procedure that computes the player’s best 
response (pure) strategy to any mixed strategy of his opponent, given as input. 

Definition (best-response oracle). A best-response oracle for the row player in a zero-sum game 
described by a matrix G G [0, is a procedure BR^ that receives as input a distribution 

q G Aat, represented as a list of atoms {(f, qi) : qi > 0}, and computes 

V g G Aat , BR^(g) G argmin ejGg 

l<i<Ar 
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with ties broken arbitrarily. Similarly, a best-response oracle BR^ for the column player accepts as 
input a p G A^r represented as a list {{i,Pi) ■ Pi > 0}, and computes 

V p G Aat , BR^(p) G argmax p^Gej . 

l<j<N 

Both best-response oracles return in time 0(1) on any input. 

Our main results regarding the runtime required to converge to an approximate equilibrium in 
zero-sum games with best-response oracles, are the following. 

Theorem 3. There exists an algorithm (see Algorithm 6 in Section 4) that for any zero-sum game 
with [0,1] payoffs and for any e > 0, terminates in time 0{y/N /e^) and outputs with high probability 
an e-approximate equilibrium. 

Theorem 4. Any (randomized) algorithm for approximating the equilibrium of N x N zero-sum 
games with best-response oracles cannot guarantee with probability greater than | that the aver¬ 
age payoff of the row player is at most \-away from its value at equilibrium in total time better 
than 0{\/N). 

As indicated earlier, these results show that best-response oracles in repeated game playing 
give rise again to a quadratic improvement in the runtime required for solving zero-sum games, as 
compared to the best possible runtime to do so without an access to best-response oracles, which 
scales linearly with N (Grigoriadis and Khachiyan, 1995; Freund and Schapire, 1999). 

The algorithm deployed in Theorem 3 above is a very natural one: it simulates a repeated game 
where both players play a slight modification of the regret minimization algorithm of Theorem 1, 
and the best-response oracle of each player serves as the optimization oracle required for the online 
algorithm; see Section 4 for more details. 

2.3 Overview of the Approach and Techniques 

We now outline the main ideas leading to the quadratic improvement in runtime achieved by our 
online algorithm of Theorem 1. Intuitively, the challenge is to reduce the number of “effective” 
experts quadratically, from N to roughly \/]V. Since we have an optimization oracle at our disposal, 
it is natural to focus on the set of “leaders”—those experts that have been best at some point 
in history—and try to reduce the complexity of the online problem to scale with the number 
of such leaders. This set is natural considering our computational concerns: the algorithm can 
obtain information on the leaders at almost no cost (using the optimization oracle, it can compute 
the leader on each round in only 0(1) time per round), resulting with a potentially substantial 
advantage in terms of runtime. 

First, suppose that there is a small number of leaders throughout the game, say L = 0{VN). 
Then, intuitively, the problem we face is easy: if we knew the identity of those leaders in advance, 
our regret would scale with L and be independent of the total number of experts N. As a result, 
using standard multiplicative weights techniques we would be able to attain vanishing regret in total 
time that depends linearly on L, and in case L = 0{y/N) we would be done. When the leaders 
are not known in advance, one could appeal to various techniques that were designed to deal with 
experts problems in which the set of experts evolves over time (e.g., Freund et ah, 1997; Blum and 
Mansour, 2007; Kleinberg et ah, 2010; Gofer et ah, 2013). However, the per-round runtime of all 
of these methods is linear in L, which is prohibitive for our purposes. We remark that the simple 
“follow the leader” algorithm, that simply chooses the most recent leader on each round of the 
game, is not guaranteed to perform well in this case: the regret of this algorithm scales with the 
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number of times the leader switches —rather than the number of distinct leaders—that might grow 
linearly with T even when there are few active leaders. 

A main component in our approach is a novel online learning algorithm, called Leaders, that 
keeps track of the leaders in an online game, and attains 0{^/L/T) average regret in expectation 
with 0(1) runtime per round. The algorithm, that we describe in detail in Section 3.1, queries the 
oracles only 0(1) times per iteration and thus can be implemented efficiently. More formally, 

Theorem 5. The expeeted T-round average regret of the Leaders algorithm is upper bounded by 
0(y^ {L/T) log(LT)), where L is an upper bound over the total number of distinct leaders during 
throughout the game. The algorithm can be implemented in 0(1) time per round in the optimizable 
experts model. 

As far as we know, this technique is new to the theory of regret minimization and may be of 
independent interest. In a sense, it is a partial-information algorithm: it is allowed to use only a 
small fraction of the feedback signal (i.e., read a small fraction of the loss values) on each round, 
due to the time restrictions. Nevertheless, its regret guarantee can be shown to be optimal in 
terms of the number of leaders L, even when removing the computational constraints! The new 
algorithm is based on running in parallel a hierarchy of multiplicative-updates algorithms with 
varying look-back windows for keeping track of recent leaders. 

But what happens if there are many leaders, say L = Q{^/N)7 In this case, we can incorporate 
random guessing: if we sample about ^/N experts, with nice probability one of them would be 
among the “top” y/N leaders. By competing with this small random set of experts, we can keep 
the regret under control, up to the point in time where at most y/N leaders remain active (in the 
sense that they appear as leaders at some later time). In essence, this observation allows us to 
reduce the effective number of leaders back to the order of Vn and use the approach detailed above 
even when L = n(\/]V), putting the Leaders algorithm into action at the point in time where the 
top y/N leader is encountered (without actually knowing when exactly this event occurs). 

In order to apply our algorithm to repeated two-player zero-sum games and obtain Theorem 3, 
we first show how it can be adapted to minimize regret even when used against an adaptive adver¬ 
sary, that can react to the decisions of the algorithm (as is the case in repeated games). Then, via 
standard techniques (Freund and Schapire, 1999), we show that the quadratic speedup we achieved 
in the online learning setting translates to similar speedup in the solution of zero-sum games. In a 
nutshell, we let both players use our online regret-minimization algorithm for picking their strate¬ 
gies on each round of the game, where they use their best-response oracles to hll the role of the 
optimization oracle in the optimizable experts model. 

Our lower bounds (i.e.. Theorems 2 and 4) are based on information-theoretic arguments, which 
can be turned into running time lower bounds in our oracle-based computational model. In par¬ 
ticular, the lower bound for zero-sum games is based on a reduction to a problem investigated 
by Aldous (1983) and revisited years later by Aaronson 2006, and reveals interesting connections 
between the solution of zero-sum games and local-search problems. Aldous investigated the hard¬ 
ness of local-search problems and gave an explicit example of an efficiently-representable (random) 
function which is hard to minimize over its domain, even with access to a local improvement oracle. 
(A local improvement oracle improves upon a given solution by searching in its local neighbor¬ 
hood.) Our reduction constructs a zero-sum game in which a best-response query amounts to a 
local-improvement step, and translates Aldous’ query-complexity lower bound to a runtime lower 
bound in our model. 

Interestingly, the connection to local-search problems is also visible in our algorithmic results: 
our algorithm for learning with optimizable experts (Algorithm 2) involves guessing a “top \/iV” 
solution (i.e., a leader) and making \/N local-improvement steps to this solution (i.e., tracking 




the finalist leaders all the way to the final leader). This is reminiscent of a classical randomized 
algorithm for local-search, pointed out by Aldous (1983). 

3 Algorithms for Optimizable Experts 

In this section we develop our algorithms for online learning in the optimizable experts model. 
Recall that we assume a more general setting where there is no optimization oracle, but instead the 
player observes after each round t the identity of the leader (see Eq. (1)) as part of the feedback 
on that round. Thus, in what follows we assume that the leader is known immediately after 
round t with no additional computational costs, and do not require the oracle Opt any further. 

To simplify the presentation, we introduce the following notation. We fix an horizon T > 0 and 
denote by /i ,..., /r the sequence of loss functions induced by the actions yi,... ,yT chosen by the 
adversary, where ft{-) = i{-,yt) for all t; notice that the resulting sequence /i, ..., /t is a completely 
arbitrary sequence of loss functions over X, as both i and the yt’s are chosen adversarially. We also 
fix the set of experts to X = [A^] = {1,..., A^}, identifying each expert with its serial index. 

3.1 The Leaders Algorithm 

We begin by describing the main technique in our algorithmic results—the Leaders algorithm— 
which is key to proving Theorem 1. Leaders is an online algorithm designed to perform well 
in online learning problems with a small number of leaders, both in terms of average regret and 
computational costs. The algorithm makes use of the information on the leaders x^, X 2 ,. • • received 
as feedback to save computation time, and can be made to run in almost constant time per round 
(up to logarithmic factors). 


Parameters: L, T 

1. Set rjo = ■\/log(2LT) and u = 2r]o\/L/T 

2. For all r = 1,..., [log 2 T] and s = 1,..., [log 2 T], initialize an instance Ar^s of 
MW^(A:,r 7 , 7 ) with fe = 2 *, 7 = rjo/V 2 *+l and 7=7 

3. Initialize an instance A of MW^(i^, 7 ^) algorithm on the Ar,s’s as experts 

4. For t = 1,2,...: 

(a) Play the prediction xt of the algorithm chosen by A 

(b) Observe feedback ft and the leader x^, and update all algorithms Ar,s 

Algorithm 1: The Leaders algorithm. 


The Leaders algorithm is presented in Algorithm 1. In the following theorem we state its 
guarantees; the theorem gives a slightly more general statement than the one presented earlier in 
Theorem 5, that we require for the proof of our main result. 

Theorem 6. Assume f/iaf Leaders is used for prediction with expert advice (with leaders feedback) 
against loss functions /i, / 2 ,... : [N] 1 —)• [0,1], and that the total number of distinct leaders during 
a certain time period to < t < ti whose length is bounded by T, is at most L. Then, provided the 
numbers L and T are given as input, the algorithm obtains the following regret guarantee: 


E 




lt=to+l 








7=1 


t=l 


< 25y/Lriog(2LT) . 


The algorithm can he implemented to run in 0(log^(L) log(T)) time per round. 
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Algorithm 2 relies on two simpler online algorithms—the MW^ and MW^ algorithms—that we 
describe in detail later on in this section (see Section 3.3, where we also discuss an algorithm called 
MW^). These two algorithms are variants of the standard multiplicative weights (MW) method 
for prediction with expert advice. MW^ is a rather simple adaptation of MW which is able to 
guarantee bounded regret in any time interval of predefined length: 

Lemma 11. Suppose that MW^ (Algorithm 3 below) is used for prediction with expert advice, 
against an arbitrary sequence of loss functions /i,/ 2 ,... : [N] i—)• [0,1] over N experts. Then, for 
1 = ^ and any rj > 0, its sequence of predictions xi,X 2 , ■ ■ ■ satisfies 




E 






min 

x£[N] 


t=to 


< 


’I 


in any time interval {to,... ,ti} of length at most T. The algorithm can be implemented to run in 
0{N) time per round. 

The MW^ algorithm a “sliding window” version of MW^, that given a parameter A: > 0, 
maintains a buffer of k experts that were recently “activated”; in our context, an expert is activated 
on round t if it is the leader at the end of that round. MW^ competes (in terms of regret) with 
the k most recent activated experts as long as they remain in the buffer. Formally, 

Lemma 13. Suppose that MW^ (Algorithm 5 below) is used for prediction with expert advice, 
against an arbitrary sequence of loss functions /i,/ 2 ,--- : [Af] > [0,1] over N experts. Assume 
that expert x* G [A^j was activated on round to, and from that point until round ti there were no 
more than k different activated experts (including x* itself). Then, for 7 = ^ and any rj > 0, the 
predictions xi,X 2 ,... of the algorithm satisfy 


E 


t'l 

ftixt) 

_t = t'g + l 


A 

Y - 

t = t'g+l 


rj 


in any time interval C [to,Ai] of length at most T. Furthermore, the algorithm can he 

implemented to run in time 0(1) per round. 

For the analysis of Algorithm 1, we require a few definitions. We let I = {to + Ij • • ■ j Ai} denote 
the time interval under consideration. For all t G /, we denote by St = ... ,xl{ the set of 

all leaders encountered since round to + 1 up to and including round t; for completeness we also 
define St^ = 0. The theorem’s assumption then implies that \Stf,\ < ... < l^til < L. For a set of 
experts S C [N], we let t{S) = maxjt G I : G S'} be the last round in which one of the experts 

in S occurs as a leader. In other words, after round t(S'), the leaders in S have “died out” and no 
longer appear as leaders. 

Next, we split I into epochs /i, I 2 ,..., where the i’th epoch /j = (rj +1,..., Tj+i} spans between 
rounds Ti + 1 and Tj+i, and r, is defined recursively by ri = to and Tj+i = T{Sri+i) for alH = 1, 2,.... 
In words, is the set of leaders encountered by the beginning of epoch i, and this epoch ends 

once all leaders in this set have died out. Let m denote the number of resulting epochs (notice 
that m < L, as at least one leader dies out in each of the epochs). For each z = 1,..., m, let Tj 
denote the length of the i’th epoch, namely m = maxjt : Ti < tij, and let z* = x*. be the leader 
at the end of epoch i. Finally, for each epoch i = 1,..., m we let Cj = 8^+1 \ denote the 

set of leaders that have died out during the epoch, and for technical convenience we also define 
Co = Cm+i = 0; notice that Ci,..., Cm is a partition of the set of all leaders, so in particular 
IClil + • • • + I Cm I < L. See Fig. 1 for an illustration of the definitions. 
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Figure 1: An illustration of the key definitions in the analysis of Algorithm 1. Each expert is 
represented by a horizontal segment, which signifies the time interval between the expert’s first and 
last appearances as leader (the experts are sorted by their first time of appearance as leaders). The 
resulting partition Ci,..., 6*4 of the experts and the induced epochs Ii,..., /4 are indicated by the 
dotted lines. 
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Our first lemma states that minimizing regret in each epoch i with respect to the leader z* 
at the end of the epoch, also guarantees low regret with respect to the overall leader It is a 
variant of the “Follow The Leader, Be The Leader” lemma (Kalai and Vempala, 2005). 

Lemma 7. Following the epoch’s leader yields no regret, in the sense that 

m ti to 

i=l t=l t=l 

Proof. Let Iq = {1, • • •, ^o} and Zq = x'l^. We will prove by induction on m > 0 that 

m m 

EE^ (2) 

2=0 t^Ii 2 = 0 tG/i 

This inequality would imply the lemma, as zlf^ = For m = 0, our claim is trivial as both sides 
of Eq. (2) are equal. Now, assuming that Eq. (2) holds for m — 1, we have 

m—1 m—1 

EE«".*) EE /t(^m-i) (induction) 

2=0 te/j 2=0 t^ii 

ra—l 

^ EE/<("»)■ 

i=0 t&Ii 

since by definition z'^_i performs better than any other expert, and in particular than zlf,, through¬ 
out the first m — 1 epochs. Adding the term to both sides of the above inequality, 

we obtain Eq. (2). □ 
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Next, we identify a key property of our partition to epochs. 

Lemma 8. For all epochs i, it holds that z* £ Ci. In addition, any leader encountered during the 

lifetime of z* as leader (i.e., between its first and last appearances in the sequence of 

leaders) must he a member of Ci-i U C, U Cj+i. 

Proof. Consider epoch i and the leader z* at the end of this epoch. To see that z* £ Ci, recall 
that the i’th epoch ends right after the leaders in Ci have all died out, so the leader at the end of 
this epoch must be a member of the latter set. This also means that z^ was first encountered not 
before epoch i — 1 (in fact, even not on the first round of that epoch), and the last time it was a 
leader was on the last round of epoch i (see Fig. 1). In particular, throughout the lifetime of z) as 
leader, only the experts in Ci-i U Cj U Cj+i could have appeared as leaders. □ 

We are now ready to analyze the regret in a certain epoch i with respect to its leader z*. To 

this end, we define ki = |Ci_i| + ICjl + ICj+il and consider the MW^ instance = A^^si, where 
Vi = |'log 2 A:*] and Si = |'log 2 rj] (note that 1 < rj < |'log 2 L] and 1 < Sj < |'log 2 T]). The following 
lemma shows that the regret of the algorithm A^'''l in epoch i can be bounded in terms of the 
quantity ki. Below, we use z\'''^ to denote the decision of A^^^ on round t. 

Lemma 9. The cumulative expected regret of the algorithm throughout epoch i, with respect 
to the leader z* at the end of this epoch, has 


E 


E/‘(T) 

t&ii 


< 10VkiTilog{2LT) . 

t&h 


Proof. Recall that has a buffer of size Qi = 2'’* and step size ry = ^\og{2LT ). Now, from 
Lemma 8 we know that z) £ Ci, which means that z* first appeared as leader either on or before the 
first round of epoch i. Also, the same lemma states that the number of distinct leaders that were 
encountered throughout the lifetime of z* (including z* itself) is at most ICj-iUCiUCj+il = ki < Qi, 
namely no more than the size of buffer. Hence, applying Lemma 13 to epoch i, we have 


E 


t&h 


t&h 


41og(2Lr) 


+ hiQiTi , 


where we have used Qi < 2L and Tj < T to bound the logarithmic term. Now, note that Qi < 2ki and 
-^log(2Lr)/4fcjTj < r]i < -^/log {2LT)/kiTi, which follow from ki < 2^* < 2ki and Ti < 2^* < 2Ti. 
Plugging into the above bound, we obtain the lemma. □ 

Our final lemma analyzes the MW algorithm A, and shows that it obtains low regret against 
the algorithm in epoch i. 

Lemma 10. The difference between the expected cumulative loss of Algorithm 1 during epoch i, 
and the expected cumulative loss of A^^^ during that epoch, is bounded as 


E 


t£li t&h 


V 
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Proof. The algorithm A is following MW^ updates over m = [log 2 T] • [log 2 r] algorithms as 
meta-experts. Thus, Lemma 11 gives 


E 


'^ft{xt) - 

teii t&ii 


21og(mT) ^ 
< -- - + i^T 


Using m < 2LT to bound the logarithmic term gives the result. 


□ 


We now turn to prove the theorem. 


Proof of Theorem 6. First, regarding the running time of the algorithm, note that on each round Al¬ 
gorithm 1 has to update 0(log(L) log(T)) instances of MW^, where each such update costs at most 
O(logL) time according to Lemma 12. Hence, the overall runtime per round is 0(log^(L) log(T)). 

We next analyze the expected regret of the algorithm. Summing the bounds of Lemmas 9 
and 10 over epochs i = 1,..., m and adding that of Lemma 7, we can bound the expected regret 
of Algorithm 1 as follows: 


E 




U=to+i 


^0 




to) 


^1=1 


t=l 


i=l i=l ^ 


(3) 


where we have used m < L and ^ Tj = T. In order to bound the sum on the right-hand side, 
we first notice that \Ci\ < 3L. Hence, using the Cauchy-Schwarz inequality we 

get Yhi 'J ^ — V3LT. Combining this with Eq. (3) and our choice of u, and 

rearranging the left-hand side of the inequality, we obtain 


E 


Y •^**^^*) 


U=io+l 


to 




•to) 


\t=l 


t=l 


< 25v^Lriog(2Lr) , 


and the theorem follows. 


□ 


3.2 Main Algorithm 

We now ready to present our main online algorithm: an algorithm for online learning with optimiz- 
able experts, that guarantees e expected average regret in total 0{y/Nje^) time. The algorithm is 
presented in Algorithm 2, and in the following theorem we give its guarantees. 

Theorem 1 (restated). The expected average regret of Algorithm 2 on any sequence of T loss 
functions /i,..., /t : [N] i—)• [0,1] over N experts is upper bounded by log(Ar)/\/T. The 

algorithm can be implemented to run in 0(1) time per round in the optimizable experts model. 

Algorithm 2 relies on the Leaders and MW^ algorithms discussed earlier, and on yet another 
variant of the MW method—the MW^ algorithm—which is similar to MW^. The difference be¬ 
tween the two algorithms is in their running time per round: MW^, like standard MW, runs in 
0{N) time per round over N experts; MW^ is an “amortized” version of MW^ that spreads com¬ 
putation over time and runs in only 0(1) time per round, but requires N times more rounds to 
converge to the same average regret. 
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Parameters: N, T 

1. Set 1 ] = 2/{N^/^Vf) and = 2 v^log( 2 r)/T 

2 . Sample a set R of [2^/NlogT\ experts uniformly at random with replacement 

3. Initialize an instance Ai of MW^(r 7 , y) on the experts in R 

4. Initialize an instance A2 of Leaders(L,T) with L = ['v/iVj 

5. Initialize an instance A of MW^(z/, 7 ^) algorithm on Ai and A2 as experts 

6 . For t = 1,2,... ,T: 

(a) Play the prediction xt of the algorithm chosen by A 

(b) Observe ft and the new leader Xt, and use them to update Ai, A2 and A 
Algorithm 2: Algorithm for online learning with an optimization oracle. 


Lemma 12. Suppose that MW^ (see Algorithm 4) is used for prediction with expert advice, against 
an arbitrary sequence of loss functions fi, f 2 , ■■ ■ : [A^] 1 —)■ [0,1] over N experts. Then, for j = A 
and any r] > 0, its sequence of predictions xi,X 2 , ■ ■ ■ satisfies 


E 




.t=to 


il 

— min > ft( 


x) < 


4log{NT) 

TJ 


+ r]NT 


in any time interval {to, • ■ • of length at most T. The algorithm can he implemented to run in 
0{logN) time per round. 


Given the Leaders algorithm, the overall idea behind Algorithm 2 is quite simple: hrst 
guess y/N experts uniformly at random, so that with nice probability one of the “top” VlS experts 
is picked, where experts are ranked according to the last round of the game in which they are 
leaders. (In particular, the best expert in hindsight is ranked first.) The hrst online algorithm 
.4.1—an instance of MW^—is designed to compete with this leader, up to that point in time where 
it appears as leader for the last time. At this point, the second algorithm A 2 —an instance of 
Leaders —comes into action and controls the regret until the end of the game. It is able to do 
so because in that time period there are only few different leaders (i.e., at most y/N), and as we 
pointed out earlier. Leaders is designed to exploit this fact. The role of the algorithm .4, being 
executed on top of .4i and A 2 as experts, is to combine between the two regret guarantees, each in 
its relevant time interval. 

Using Theorem 6 and Lemmas 11 and 12, we can formalize the intuitive idea sketched above 
and prove the main result of this section. 

Proof of Theorem 1. The fact that the algorithm can be implemented to run in 0(1) time per 
round follows immediately from the running time of the algorithms MW^, MW^, and LEADERS, 
each of which runs in 0(1) time per round with the parameters used in Algorithm 2. 

We move on to analyze the expected regret. Rank each expert x G [N] according to rank(x) = 0 
if X is never a leader throughout the game, and rank(x) = max{t : = x} otherwise. Let 

X(i),... ,X( 7 v) be the list of experts sorted according to their rank in decreasing order (with ties 
broken arbitrarily). In words, X(i) is the best expert in hindsight, X( 2 ) is the expert leading right 
before X(i) becomes the sole leader, X( 3 ) is the leading expert right before X(i) and X( 2 ) become the 
only leaders, and so on. Using this definition, we define X* = {x(i),..., X(„)} be the set of the top 
n = [\/iVj experts having the highest rank. 

First, consider the random set R. We claim that with high probability, this set contains at least 
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one of the top n leaders. Indeed, we have 




/ nxL^V^iog'TJ 

*> = h-iv) 


< 



1 \ VN log T 

2^/N^ 


_ 1 
< e 2 


logT 


1 


SO that with probability at least 1 — 1/y/T it holds that Rr\X* / 0. Asa result, it is enough to upper 
bound the expected regret of the algorithm for any fixed realization of R such that R H X* ^ 0; in 
the event that the intersection is empty, that occurs with probability I/^/T, the regret can be at 
most T and thus ignoring these realizations can only affect the expected regret by an additive \/T 
term. Hence, in what follows we fix an arbitrary realization of the set R such that R n X* ^ 0 and 
bound the expected regret of the algorithm. 

Given R with R n X* ^ 0, we can pick x G ii H X* and let Tq = rank(x) be the last round in 
which X is the leader. Since x £ R and \R\ < 2^/NlogT, the MW^ instance Ai over the experts 
in R, with parameter rj = 2/{N^^^\/T), guarantees (recall Lemma 12) that 


E 




.t=i 


^0 

+ < 8ivi/^\/riog(2ivr), 

t=i ^ 


(4) 


where we use to denote the decision of Ai on round t. 

On the other hand, observe that there are at most n different leaders throughout the time 
interval {Tq + 1,..., T}, which follows from the fact that x G X*. Thus, in light of Theorem 6, we 
have 


E 


E 

t=To+l 


To 


E - Y. < 25ivi/V7^iog(2ivr), 


(5) 


\t=l 


t=l 


( 2 ) 

where here x) ^ denotes the decision of A 2 on round t. 

Now, since Algorithm 2 is playing MW^ on Ai and A 2 as experts with parameter v = 
2^/log{2T)/T, Lemma 11 shows that 


E 


To 


To 






.t=i 


t=i 


< + = 3v'Tiog(2r) 


and similarly, 


E 


Y “ Y 


.(2)^ 


< 3Vriog(2r). 


t —To+1 t —Tb+1 

Summing up Eqs. (4) to (7) we obtain the regret bound 


E 




Lt=l 


Yfi(^T) ^ 3m^/^Vflog{NT) 


t=i 


( 6 ) 


(7) 


( 8 ) 


for any fixed realization of R with RDX* A 0- As we explained before, the overall expected regret 
is larger by at most \/T than the right-hand side of Eq. (8), and dividing through by T gives the 
theorem. □ 
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3.3 Multiplicative Weights Algorithms 

We end the section by presenting the several variants of the Multiplicative Weights (MW) method 
used in our algorithms above. For an extensive survey of the basic MW method and its applications, 
refer to Arora et ah, 2012 . 

3.3.1 MWh Mixed MW 

The first variant, the MW^ algorithm, is designed so that its regret on any time interval of bounded 
length is controlled. The standard MW algorithm does not have such a property, because the weight 
it assigns to an expert might become very small if this expert performs badly, so that even if the 
expert starts making good decisions, it cannot regain a non-negligible weight. 

Our modification of the algorithm (see Algorithm 3) involves mixing in a fixed weight to the 
update of the algorithm, for all experts on each round, so as to keep the weights away from zero at 
all times. We note that this is not equivalent to the more standard modification of mixing-in the 
uniform distribution to the sampling distributions of the algorithms: in our variant, it is essential 
that the mixed weights are fed back into the update of the algorithm so as to control its weights. 


Parameters: rj, 7 

1. Initialize rci(x) = 1 for all a: G [A] 

2. For t = 1,2,...: 

(a) For all x G [N] compute qt{x) = wt{x)/Wt with Wt = Yly '^t{y) 

(b) Pick xt ^ qt, play xt and receive feedback ft 

(c) For all x G [N], update wt+i{x) = -k j^Wt 

Algorithm 3: The MW^ algorithm. 


In the following lemma we prove a regret bound for the MW^ algorithm. We prove a slightly 
more general result than the one we stated earlier in the section, which will become useful for the 
analysis of the subsequent algorithms. 

Lemma 11. For any sequence of loss functions /i,/ 2 ,... : [N] M+ and for 7 = A any 
r] > 0, Algorithm 3 guarantees 


ti 


E 


J=tQ 




min 

xG[N] 


t=to 


< 


2 log(Ar) 

7 


tl 


+ rjK 




in any time interval {to; • ■ • >^ 1 } of length at most T. The algorithm can he implemented to run in 
0{N) time per round. 

Proof. The claim regarding the runtime of the algorithm is trivial, as all the computations on a 
certain round can be completed in a single pass over the N actions. Thus, we move on to analyze 
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the regret; the proof follows the standard analysis of exponential weighting schemes. We first write: 


W, 


E wt+iix) 

W, 


ajelAf] 


Am N 


(using < 1 + z + for z < 1 ) 


= E 

xelAf] 

= 7+ 

3;e[Af] 

< 7+ X] Qt(x) (l-rfft(x)+?]^ft(xf) 

xe[N] 

< 1 + 7-7? ^ qt(x)ft(x)+7]^ ^ qt{x)ft{xf . 

xeiAf] xSiAf] 

Taking logarithms, using log(l + z) < z for all 2 > —1, and summing over t = to,... ,ti yields 

m,+i 


log 


mo 


ti ti 

< jT + r]‘^T -rjY^ ^ Qtix)ftix) + rj'^ Y Qtix)ft{x)‘^ • 

t=to x£ [Af] t=to x£ [W] 


Moreover, since for all t and x we have wt+i{x) > wt{x) exp{—r]ft{x)), for any fixed action x* we 
also have 


*1 


ti 


wt^+i{x*) > wto+i{x*)expYrjYftix*)\ > 7Cto+i(x*)exp -7?^/i(x*) 


t=to + l 


t=to 


and since mt^+i > wtj^+iix*) and wtQ+i{x*) > {'y/N)WtQ, we obtain 


mt.+i 


N 


log '' > -riY ft{x*) - log 

t=to ^ 


Putting together and rearranging gives 

ti ti 


Y Y 1 + + ^ qt{x)ft{a 

t=tox&[N] t=to ^ t=tox&[N] 


ti 


Finally, taking expectations and setting 7=7 yields the result. 


□ 


3.3.2 MW^: Amortized Mixed MW 

We now give an amortized version of the MW^ algorithm. Specifically, we give a variant of the 
latter algorithm that runs in 0(1) per round and attains an 0{^/NT) bound over the expected 
regret, as opposed to the MW^ algorithm that runs in time 0{N) per round and achieves 0{y/T) 
regret. The algorithm, which we call MW^, is based on the MW^ update rule and incorporates 
sampling for accelerating the updates.® 

For Algorithm 4 we prove: 

®This technique is reminiscent of bandit algorithms; however, notice that here we separate exploration and ex¬ 
ploitation: we sample two experts on each round, instead of one as required in the bandit setting. 
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Parameters: 77 , 7 

1. Initialize wi{x) = 1 for all a: G [A^] 

2. For t = 1, 2,.. 

(a) For all x G [N] compute qtix) = Wt{x)/Wt with Wt = Yly'^tiv) 

(b) Pick xt ^ qt, play xt and receive feedback ft 

(c) Pick yt G [N] uniformly at random, and for all x G [A^] update: 


wt+i{x) 


+ ^Wt if X = yt 
wt{x) + otherwise 


Algorithm 4: The MW^ algorithm. 


Lemma 12. For any sequence of loss functions /i, / 2 ,... : [N] 1 —)• [0,1], and for 7 = y and any 
rj > 0, the Algorithm 4 guarantees that 


ti 


E 


Y1 

-t=to 




min 

xG[N] 


t=to 


< 


4log( NT) 

-- - + yNT 

V 


in any time interval {Iq, ..., ti} of length at most T. Furthermore, the algorithm can he implemented 
to run in 0{logN) time per round. 

Proof. We first derive the claimed regret bound as a simple consequence of Lemma 11. Define a 
sequence of loss functions fi,...,fT, as follows: 

Vx G [A^] , ft{x) = Nftiyt) • I[{x = yt} . 

Notice that Algorithm 4 is essentially applying MW^ updates to the loss functions fi,...,fT 
instead of to the original ones. Thus, we obtain from Lemma 11 that 


E 




J=to 


- E 


^1 

E/<( 

.t=to 


X 


4log(NT) 

< -- -+riE 

rj 




t=to 


for any fixed x* G [N]. We now get the lemma by observing that E[/t(x)] = ft{x) and E[/t(x)^] = 
Nft{x) < N for all t and x (also notice that xt is independent of ft). 

It remains to prove that the algorithm’s updates can be carried out in time 0(log A^) per round. 
The weights wt{x) can be maintained implicitly as a sum of variables ujt{x) = at{x) + fdt, where 
fit captures the amount of uniform distribution for the t’th weights. The main observation is that 
wt{x) can now be updated via: 

Q;t+i(x) = {at(x) + - Pt , 

Pt+i = I3t +^(^^^atix) + Nidt'j . 


Notice that the vector at+i has only one component that needs updating per iteration—the one 
corresponding to yt. The update of the scalar j3t+i needs the sum of all parameters at{x), that can 
be maintained efficiently alongside the individual weights. 

Finally, we explain how to sample efficiently from qt. We can write 


qt{x) = yt 
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with nt = Yhy^tiv))- Thus, sampling from qt can be carried out in two stages: with 

probability fxt sample uniformly at random; with the remaining probability, sample according to 
the weights at{x). In order to implement the latter sampling operation in time O(logiV), we can 
maintain a binary tree with N leaves that correspond to the weights at{x), and with the internal 
nodes caching the total weight of their descendants. □ 


3.3.3 MW^: Sliding Amortized Mixed MW 

The final component we require is a version of MW^ that works in an online learning setting with 
activated experts. In this version, on each round of the game one of the experts is “activated”. The 
sequence of activations is determined based only on the loss values and does not depend on past 
decisions of the algorithm; thus, it can be thought of as set by the oblivious adversary before the 
game starts. The goal of the player is to compete only with the last k (distinct) activated experts, 
for some parameter k. In the context of the present section, the expert activated on round t is 
the leader at the end of that round. Therefore, we overload notation and denote by the expert 
activated on round t. 


Parameters: k, i], 7 

1. Initialize Bi{i) = i and wi{i) = 1 for all i G [k] 

2. For t = 1,2,...: 

(a) For all i G [k] compute qt{i) = wt{i)/Wt with Wt = 

(b) Pick it ~ gt, play xt = Bt{it), receive ft and new activated expert Xt 

(c) Update weights: pick jt G [k] uniformly at random, and set: 


Vi G [A:] , wt+iii) = 


wtii)e-^^MBtii)) + if i = 


otherwise 


(d) Update buffer: set Bt+i = Bt\ii Xt ^ Bt, find the index i' G [/c] of the oldest 
activated expert in Bt (break ties arbitrarily) and set Bt+i{i') = Xt- 


Algorithm 5: The MW^ algorithm. 


The MW^ algorithm, presented in Algorithm 5, is a “sliding window” version of MW^ that 
keeps a buffer of the last k (distinct) activated experts. When its buffer gets full and a new expert 
is activated, the algorithm evicts from the buffer the expert whose most recent activation is the 
oldest. (Notice that the latter expert is not necessarily the oldest one in the buffer, as an expert 
can be re-activated while already in the buffer.) In this case, the newly inserted expert is assigned 
the same weight of the expert evicted from the buffer. 

For the MW^ algorithm we prove: 


Lemma 13. Assume that expert x* G [N] was activated on round to, and from that point until 
round ti there were no more than k different activated experts (including x* itself). Then, for any 
time interval [to,ti] C [to,Ai] of length at most T, the MW^ algorithm with parameters k, j = 
and any 7 > 0 guarantees 


E 


ftixt) 


U=*o+l 


E /<(*' 


< 


41og(A:T) 


-|- r]kT . 




Furthermore, the algorithm can he implemented to run in time 0(1) per round. 
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Proof. For each index i = 1,... ,k, imagine a “meta-expert”, whose loss on round t of the game 
is equal to the loss incurred on the same round by the expert occupying entry i of the buffer Bt 
on round t. Then, notice that we can think of the algorithm as following MW^ updates over 
these meta-experts. Importantly, the assignment of losses to the meta-experts is oblivious to the 
predictions of the algorithm: indeed, the only factor affecting the addition/removal of experts 
to/from the buffer is the pattern of their activation, which is being decided by the adversary before 
the game begins. 

Furthermore, as long as an expert is not evicted from the buffer during a certain period, it 
occupies the same entry and thus is associated with the same meta-expert throughout that period. 
In particular, throughout the period between rounds Fq -|- 1 and the expert is associated with 
some fixed meta-expert: Xj was activated on round to < removed from the buffer 

by round because there were at most k different activated experts up to round ti > Hence, 
the first claim of the lemma follows directly from the regret bound of MW^ stated in Lemma 12. 

Finally, the claim regarding the runtime of the algorithm can be obtained by using standard 
data structures for the maintenance of the buffer (and the associated weights). For example, Bt 
can be maintained sorted according to last time of activation; for accommodating the membership 
query ^ in 0(1) time, one can maintain an additional copy of Bt represented by a set data 
structure. □ 

4 Solving Zero-sum Games with Best-response Oracles 

In this section we apply our online algorithms to repeated game playing in zero-sum games with 
best-response oracles. Before we do that, we first have to extend our results to the case where 
the assignment of losses can be adaptive to the decisions of the algorithm. Namely, unlike in the 
standard model of an oblivious adversary where the loss functions are being determined before the 
game begins, in the adaptive setting the loss function ft on round t may depend on the (possibly 
randomized) decisions xi,..., xt-i chosen in previous rounds. 

Fortunately, with minor modifications Algorithm 2 can be adapted to the non-oblivious setting 
and obtain low regret against adaptive adversaries as well. In a nutshell, we show that the algorithm 
can be made “self-oblivious”, in the sense that its decision on each round depends only indirectly 
on its previous decisions, through its dependence on the previous loss functions; algorithms with 
this property are ensured to work well against adaptive adversaries (see McMahan and Blum, 2004; 
Dani and Hayes, 2006; Cesa-Bianchi and Lugosi, 2006). Furthermore, the same property is also 
sufficient for the adapted algorithm to obtain low regret with high probability, and not only in 
expectation. The formal details are given in the proof of the following corollary. 

Corollary 14. With probability at least 1 — 5, the average regret of Algorithm 2 (when implemented 
appropriately) is upper bounded by lOV^/^ log(^)/-v/r. This is true even when the algorithm faces 
a non-oblivious adversary. 

Proof. We first explain how Algorithm 2 can be implemented so as the following “self-obliviousness” 
property holds, for all rounds t: 

P(xt = x I xi,...,xt_i,/i,...,/t_i) = ¥{xt = X \ fi,... ,ft-i) ■ (9) 

We can ensure this holds by randomizing separately for making the decisions xt, and for updating 
the algorithm. That is, if xt is sampled from a distribution pt on round t of the game, then 
we discard xt and use a different independent sample x( ~ pt for updating the algorithm. In fact, 
Algorithm 2 makes multiple updates on each round (to the various Ar,s algorithms, etc.); we ensure 
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that the sample xt is not used in any of these updates, and a fresh sample is picked when necessary. 
Further, we note that this slight modification does not impact the runtime of the algorithm, up to 
constants. 

With Algorithm 2 implemented this way, we can now use, e.g.. Lemma 4.1 of Cesa-Bianchi and 
Lugosi (2006) to obtain from Theorem 1 that 


1 ^ 


mm — 
xelAf] T 


T 




< 


40iVV4 log(ivr) /logi 

V? ^ V ^ 


holds with probability at least 1 — 5 against any non-oblivious adversary. Further upper bounding 
the right-hand side of the above yields the stated regret bound. □ 

Using standard techniques (Freund and Schapire, 1999), we can now use Algorithm 2 to solve 
zero-sum games equipped with best-response oracles. The simple scheme is presented in Algo¬ 
rithm 6: both players use the online Algorithm 2 to produce their decisions throughout the game, 
and employ their best response oracles to compute the “leaders”. In the context of zero-sum games, 
the leader on iteration t is the best response to the empirical distribution of the past plays of the 
opponent. 


Parameters: game matrix G G [0, parameter T 

1. Initialize instances .Ai, A 2 of Algorithm 2 with parameters N, T for the row and 
column players, respectively 

2. For t = 1,2,... ,T: 

(a) Let the players play the decisions xt,yt of A.i, A.2, respectively 

(b) Let pt be the empirical distribution of xi,..., xt, and let qt be the empirical 
distribution of yi,... ,yt 

(c) Update Ai with the loss function G{-,yt) and the leader x^ = 

(d) Update A 2 with the loss function G{xt, •) and the leader yl = BR^(pt) 

Output: the profile [p,q) = {Pt,Qt) 

Algorithm 6: Algorithm for zero-sum games with best-response oracles. 

We remark that, in fact, it is sufficient that only one of the players follow Algorithm 2 for 
ensuring fast convergence to equilibrium. The other player could, for example, behave greedily 
and simply follow his best response to the plays of the first player—a strategy known as “fictitious 
play”. 

Finally, we analyze Algorithm 6, thereby proving Theorem 3. 

Theorem 3 (restated). With probability at least 1 — 5, Algorithm 6 with T = 240^y]v 2 iON 
outputs a mixed strategy profile {p, q) being an e-approximate equilibrium. The algorithm can be 
implemented to run in 0{y/N /e^) time. 

Proof. First, we note the running time. Notice that the empirical distributions pt and qt do not 
have to be recomputed each round, and can be maintained incrementally with constant time per 
iteration. Furthermore, since we assume that each call to the best response oracles costs 0(1) 
time. Theorem 1 shows that the updates of Ai,A 2 can be implemented in time 0(1). Hence, each 
iteration costs 0(1) and so the overall runtime is 0{y/N /e^). 
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We move on to analyze the output of the algorithm. Using the regret guarantee of Corollary 14 
for the online algorithm of the row player, we have 


1 - ^ 1 ^ 

- Y, G(xi, yt) - min - G{x, yt) < 


T 


t=i 


t=i 


40iV^/'^ , 2NT 


with probability at least 1 — 2' Similarly, for the online algorithm of the column player we have 


max — 
!/6[iV] 


t=l t=l 


, 2NT 
log—^ 


with probability at least 1 — f • Hence, summing the two inequalities and using ^ G{xt, y) = 
G(p, y) and ^ G{x, yt) = G(x, q), we have 


max G{p, q) 
q&^N 


SOiV^/^ , NT 

mm G p, g < -=—log —^ < e 

peA^ ^ ^ ,5 - 


( 10 ) 


with probability at least 1 — <5; the ultimate inequality involves our choice of T and a tedious 
calculation. 

Now, let {p*, q*) be an equilibrium of the game, and denote by A* = G{p*, q*) the value of the 
game. As a result of Eq. (10), for all g G we have 


G{p,q) < minG(p, g) + e < G(p*,g) + e < G(p*,g*) + e = A* + e . 

pSAjv 

Similarly, for all p G A^r, 


G{p,q) > maxG(p,g)-e > G{p,q*)-e > G{p\q*)-e = X* - e . 

(jSAjv 

This means that, with probability at least 1 —<5, the mixed strategy prohle (p, g) is an e-approximate 
equilibrium. □ 


5 Lower Bounds 

In this section we prove our computational lower bounds for optimizable experts and learning in 
games, stated in Theorems 2 and 4. We begin with the latter, and then prove Theorem 2 via 
reduction. 

Theorem 4 (restated). For any efficient (randomized) players in a repeated N x N zero-sum game 
with best-response oracles, there exists a game matrix with [0,1] values such that with probability 
at least the average payoff of the players is at least | far from equilibrium after 0(\/iV/ log^ N) 
time. 

We prove Theorem 4 by means of a reduction from a local-search problem studied by Aldous 
(1983), which we now describe. 
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Lower bounds for local search. Consider a function / : {0, i—)• N over the d-dimensional 
hypercube. A local optimum of / over the hypercube is a vertex such that the value of / at this 
vertex is larger than or equal to the values of / at all neighboring vertices (i.e., those with hamming 
distance one). A function is said to be globally-consistent if it has a single local optimum (or in 
other words, if every local optimum is also a global optimum of the function). 

Aldous (1983) considered the following problem (slightly rephrased here for convenience), which 
we refer to as Aldous’ problem: Given a globally-consistent function / : {0,1}'^ i—)• N (given as a 
black-box oracle), determine whether the maximal value of / is even or odd with a minimal number 
of queries to the function. 

The following theorem is an improvement of Aaronson (2006) to a result initially proved by 
Aldous (1983). 

Theorem 15 (Aldous, 1983; Aaronson, 2006). For any randomized algorithm for Aldous’ problem 
that makes no more than ll(2'^/^/d^) value queries in the worst case, there exists a function f : 
{0, l}'^ I— >■ N such that the algorithm cannot determine with probability higher than | whether the 
maximal value of f over {0,1}'^ is even or odd. 

We reduce Aldous’ problem to the problem of approximately solving a large zero-sum game 
with best-response and value oracles. Our reduction proceeds by constructing a specific form of a 
game, which we now describe. 


The reduction. Let / : [A^] i—>■ N be an input to Aldous’ problem, with maximal value f* = 
maxjg[7v] f{i)- (Here, we identify each vertex of the [log2 A^]-dimensional hypercube with a natural 
number in the range 1,...,A^ corresponding to its binary representation.) We shall describe a 
zero-sum game with value A = A(/*), where 


V fc E N , A(fc) 


2 if A: is even, 
I if A: is odd. 


Henceforth, we use r(H) to denote the set of neighbors of a set of vertices V C [A"] of the hypercube 
(that includes the vertices in V themselves). 

The game matrix and its corresponding oracles are constructed as follows: 


Game matrix: Based on the function /, define a matrix E [0, as follows: 

A(/(A)) if i and j are local maxima of /, 

0 if/(i)>/(i), 

1 otherwise. 


Vz,j E [AT] , 


gL = 


• Value oracle: The oracle Val(i, j) simply returns GV for any i,j E [N] given as input. 

• Best-response oracles: For any mixed strategy p E Aw, define: 

BR^(p) = BR^(p) = argmax f{i) . 

ier(supp(p)) 

We set to prove Theorem 4. First, we assert that the value of the described game indeed 
equals A. 

Lemma 16. The minimax value of the game described by the matrix G^ equals A. 
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Proof. Let i* = argmax^gj^] f{i) be the global maxima of /. We claim that the (pure) strategy 
profile is an equilibrium of the game given by . To see this, notice that the payoff with 

this profile is \(f(i*)) = A. However, any profile of the form (i,i*) generates a payoff of either 
= a or = 1, since f{i) < f{i*) for each i. Hence, the row player does not benefit by 
deviating from playing i*. Similarly, a profile of the form yields a payoff of at most 0, thus 

a deviation from i* is not prohtable to the column player either. □ 

Next, we show that the value and best-response oracles we specified are correct. 

Lemma 17. The procedures Val and BR^, BR^ are eorrect value and best-response oraeles for the 
game Gf. 

Proof. The oracle Val is trivially a valid value oracle for the described game, as Val(f, j) = Gj^ for 
all i,j G [N] by definition. Moving on to the best-response oracles, we shall prove the claim for the 
oracle BR^; the proof for BR^ is very similar and thus omitted. 

Consider some input p G Ajv and denote j = BR^(p) the output of the oracle. Note that if 
i* G supp(p), then j = argmax{/(i) : i G r(supp(p))} = i* as f has a single global maxima at i*. 
The main observation in this case is that the strategy j = i* dominates all other column strategies 
j': indeed, it is not hard to verify that since f{j) > f{j') for any j', we have G{j > G^-, for all j'. 
This implies that p^G^ej > '^G^eji for all /, namely, j is a best-response to p. 

On the other hand, if i* ^ supp(p) we claim that p^G^ej = 1, which would immediately give 
p^G^Cj > p^G^Cj! for all j' (as the maximal payoff in the game is 1). This follows because if 
i* ^ supp(p), then it must hold that /(j) > f{i) for all i G supp(p), which means that GP = I for 
all i G supp(p). Hence, p^G^Cj = 1 as claimed. □ 

We can now prove our main theorem. 

Proof of Theorem 4- Let / : [N] i—)> N be an arbitrary globally-consistent function over the hyper¬ 
cube. Consider some algorithm that computes the value of the zero-sum game G^ up to an additive 
error of j with probability at least |. Determining the game value up to | gives the value of A 
(that can equal either ^ or |), which by construction is determined according to the maximal value 
of / being even or odd. Thus, the number of queries to / the algorithm makes, through one of the 
oracles, is lower-bounded by Vl{^/N/\o^ N) according to Theorem 15. In what follows, we show 
how this lower bound can be translated to a lower bound on the runtime of the algorithm. 

First, notice that the runtime of the algorithm is lower bounded by the total number of 
row/column indices touched by the algorithm, namely, the total number of indices that appear 
in inputs to one of the oracles Val, Opt at some point throughout the execution of the algorithm 
(we think of index i as appearing in the input of the call Opt(p) if f G supp(p)). Hence, if we let S 
denote the set of all indices touched by the algorithm, then it is enough to lower bound the size 
of S in order to obtain a lower bound on the runtime of the algorithm. 

Now, notice that the set of all entries of / queried by the algorithm throughout its execution, via 
one of the oracles Val and Opt, is a subset of Ujg5r(i). Indeed, upon any index i G S' that appears 
in the input to one of the oracles, the function / has to be queried only at the neighborhood r(i) to 
produce the required output (recall the definitions of the oracles in our construction of Gj above). 
Hence, the total number of distinct queries to / made by the algorithm is at most 0(|S| • log A). 
On the other hand, as we noted earlier, this number is lower bounded by D(\/]V/ log^ N) as a result 
of Theorem 15. Both bounds together yield the lower bound |S| = D(\/iV/ log^ N), which directly 
implies the desired runtime lower bound. □ 
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Now, we can obtain Theorem 2 as a direct corollary of the lower bound for zero-sum games. 
We remark that it is possible to prove a tighter lower bound than the one we prove here, via direct 
information-theoretic arguments; we defer details to Appendix A. 

Theorem 2 (restated). Any (randomized) algorithm in the optimizable experts model cannot guar¬ 
antee an expected average regret smaller than over at least T > 20 rounds in total time better 
than 0{\fN / log^ N). 

Proof. Suppose that there exists an online algorithm in the optimizable experts model that guar¬ 
antees expected average regret < ^ in total time r, where r = o(\/iV/ log^ N). Following the line 
of arguments presented in Section 4, we can show that the algorithm can be used to approximate 
zero-sum games. First, as explained in Section 4, we may assume that the algorithm is self-oblivious 
(see that section for the definition), and for such an algorithm Lemma 4.1 of Cesa-Bianchi and Lu¬ 
gosi (2006) shows that with probability at least 1 — 5 = |, the average regret after T > 20 rounds 
is at most -|- ^/\og{l/6)/2T < |. Then, following the proof of Theorem 3 we can show that the 
online algorithm, if deployed by two players in a zero-sum game, can be used to approximate the 
equilibrium of the game to within ^ with probability at least | and, with access to best response 
oracles, in total runtime 0{t). This is a contradiction to the statement of Theorem 4, proving our 
claim. □ 

5.1 Online Lipschitz-Continuous Optimization 

In this section we present a simple consequence of Theorem 2: we show that any reduction from 
online Lipschitz-continuous optimization in d-dimensional Euclidean space to the corresponding 
offline problem, must run in time exponential in the dimension d. Notice that we do not assume 
convexity of the loss functions: for convex (and Lipschitz) functions it is well known that dimension- 
free regret bounds are possible (Zinkevich, 2003). 

The online optimization model with oracles is very similar to the model we presented in Sec¬ 
tion 2. The only difference is that in online optimization, the decision set A is a compact subset 
of a d-dimensional Euclidean space. The main result of this section shows that even when the 
functions /i,/2, ■ • • are all 1-Lipschitz, the runtime required for convergence of the average regret 
in this model is exponential in the dimension d. 

Corollary 18. For any (randomized) algorithm in the oracle-based online optimization model, 
there are oracles Val and Opt and a sequence /i, /2 ,... : [0,1]'^ i—)• [0,1] of 1-Lipschitz loss functions 
in d dimensions such that the runtime required for the algorithm to attain expected average regret 
smaller than ^ is 0(2'^/^/poly(d)). 

We prove the corollary via reduction from Theorem 2. The idea is to embed a discrete online 
problem over N experts in d = [log A] dimensions. To this end, we view functions over the 
hypercube {0,1}'^ as functions over the set of experts FL = [iV] by identifying expert i with the 
vertex corresponding to i’s binary representation using d bits. Given a function / : [N] i—?■ [0,1], 
we define a function / over 1C = [0,1]“^ by 

VxG[0,l]^, 7(x) = ^ Y, fiO- YlxiYl{l-Xi) . 

It is not hard to show that / is Lipschitz-continuous over [0,1]'^. 

Lemma 19. The function f is 1-Lipschitz over [0,1]'^ (with respect to the Euclidean norm). 
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Proof. Notice that / is linear in Xj, thus for all x G [0,1]*^, 

\dif{x)\ = ^{xi,...,xd) 

= |/(xi, . . .,Xi-l, l,Xi+l, ...,Xd)- f{xi, . . . ,Xi-l,0,Xi+l, . . . ,Xrf)| 

1 


< 


Vd 


Hence, ||V/(x )||2 = \difix)\'^ < 1 which implies that / is 1-Lipschitz. 


□ 


The following lemma shows that an optimization oracle over [N] can be directly converted into 
an oracle for the extensions over the convex set [0, l]*^. 

Lemma 20. For any functions /i,..., /m : {0, i—)■ [0,1] and scalars ai,..., am > 0 we have 

1 


min y^aifi{. 
-■ 1=1 


X) = 


min , aifijx) 


\/d x£{0,l}d-‘^ 


1 = 1 


Proof. Denote / = notice that / = Oiifii^)- Then, the lemma claims that 

min f{x) = —= min f{x) . 
a;e[0,l]‘^ Vd xe{0,l}'‘ 

This follows directly from the definition of /, as f{x) is a convex combination of the values of / on 
the vertices of the hypercube, thus the minimum of / must be attained at one of the vertices. □ 

Lemma 21. Let /i,..., /r : {0,1}'^ i—)■ [0,1] be a sequence of functions and let fi,..., fx '■ [0,1]'^ i—)■ 
[0,1] be the corresponding Lipschitz extensions. Given an algorithm that achieves regret Rt on 
fi,...,fT over the decision set [0,1]'^, one can efficiently achieve expected regret of VdRT on 
fi,..., fx over the decision set {0,1}'^. 

Proof. Assume that the algorithm produced xi,..., G [0,1]'^ such that 

T T 

^7t(xi)- niin V7;(x*) < Rt . 

Noticing, by Lemma 20 above, that 


mm 


ft{x*) = —j= min ft{y*) , 
x*e[o,i]d^ \/dr6{o,iD^ 

and using randomized rounding to obtain points yi, . ■ ■ iVt £ {0,1}'^ such that K[ft{yt)] = Vdftixt) 
(that is, by choosing yt{i) = 1 with probability xt{i) for each i independently), we get 


E 




.t=i 


— mm 


Y^ftivl < VdR 


T • 


t=l 


Hence, the sequence of actions yi,... ,yT G {0, l}*^ achieves regret of VdRx in expectation with 
respect to the functions fi,..., fx. □ 


Corollary 18 now follows directly from Lemma 21 and Theorem 2. 
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A Tighter Lower Bound for Optimizable Experts 


Here we prove a slightly tighter version of Theorem 2 than the one we proved in Section 5. 


Theorem 2 (restated). Let Y > 0 and fix X = y = [Y]. For any (randomized) regret minimization 
algorithm, there is a loss function i : X xy ^ [Q,l], corresponding oraeles Val, Opt, and a sequence 
of actions yi,y 2 ,... G T such that the runtime required for the algorithm to attain expected average 
regret smaller than ^ is at least 

The proof proceeds by demonstrating a randomized construction of a hard online learning 
problem with optimizable experts, which we now describe. For simplicity, we assume that N = 
for some integer n > 1. Pick a set X* = C [Y] of n “good” experts, by choosing 

X* € Xi = {n{i — 1) -|- 1,..., ni} uniformly at random for each i = 1,..., n. Then, define the loss 
function: 


Vx,yE[Y], i{x,y) 


0 if X, y G X* and x > y, 
1 otherwise. 


Theorem 2 is obtained as a direct corollary of the following result. 
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Theorem 22. Consider an adversary that picks the sequence yi,y 2 , ■ ■ ■ ,yn, where yt = xl for 
all t. Any online algorithm whose expected runtime is o{y/N) cannot attain expected average regret 
smaller than ^ (at some point 1 < t < n during the game) on the sequence yi,... ,yn- 

For the proof, we make a simple observation given in the following lemma. 

Lemma 23. For any distribution p G A^v, there exists an x* G supp(p) which is a valid answer to 
the query Opt(p), namely, such that x* is a row index of some best expert with respect to p. 

Proof. A valid optimization oracle for the loss function i defined above is given by 

^ ^ f max{supp(p) n X*} if supp(p) n A* / 0, 

V p G Aat , Opt(p) = 

max{supp(p)| otherwise. 


It is now seen that for this oracle, it holds that Opt(p) G supp(p) for any p G A^v- □ 

Before proving Theorem 22, we state and prove a lemma which is key to our analysis. 


Lemma 24. Let A be an array of size n formed by choosing an entry uniformly at random and 
setting its value to x 0, while keeping all other entries set to zero. Any algorithm that reads at 
most entries of A in the worst case and no more than m in expectation, cannot determine the 
index of x with probability greater than 


Proof. It is enough to prove the lemma for deterministic algorithms, as any randomized algorithm 
can be seen as a distribution over deterministic algorithms. Fix some deterministic algorithm and 
denote the number of entries of A it reads by the random variable M. We assume that M <n' = 
with probability one, and E[M] < m. For each t = 1,..., re', let be an indicator for the event that 
the t’th query of the algorithm is successful (we may assume that the entire sequence of queries 
of an algorithm is defined even when it terminates before actually completing it). We can assume 
without loss of generality that the algorithm does not access an entry more than once (so that 
only one of its queries can be successful). Then, the algorithm’s probability of success is given by 

Now, denote by the filtration generated by the algorithm’s observations up to and 

including time re' (with Fq = 0), and observe that for all 1 < t < re' we have K[It \ F-i] < n-t+i — 

if the algorithm was successful in its first t — 1 queries then certainly It = 0; otherwise, the 
conditional expectation equals as the non-zero value x has the same probability of being in 

any of the re — f -|- 1 remaining entries (given any previous observations made by the algorithm). 

Define a sequence of random variables according to Zt = I for f = 1,..., re', 

and notice that Zi,..., Z^' is a martingale with respect to {Ft}, as Zt is measurable with respect 
to Ft and a simple computation shows that | Ft-i] = Zt-i. Also, observe that by definition 
M is a stopping time with respect to {Ft}, since the algorithm can only choose to stop based on 
its past observations. Hence, Doob’s optional stopping time theorem (see, e.g.. Levin et ah, 2009) 
shows that E[A^] = K[Zq] = 0. This implies that 


P( success) 


E 


■ M ■ 

E'. 


.t=i 


E 


M 


I Ft 


t-i\ 


.t=i 


_1 

lA 

M- - 


re 


< 


2m 

? 

n 


which completes the proof. 

We can now prove Theorem 22. 


□ 
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Proof of Theorem 22. In what follows, we say that an algorithm touches row index i if the algorithm 
calls, at some point throughout its execution, the oracle Val with row index i as input. We say 
that the algorithm touches column index i if it invokes the oracle Opt with a distribution which is 
supported on i. Finally, we say that an algorithm touches index i if it touches either the row index 
i or the column index i. To lower bound the runtime of a given online algorithm, it is therefore 
enough to lower bound the total number of distinct indices it touches. 

We first observe that any algorithm that touches m distinct indices can be implemented without 
invoking the oracle Opt at all, such that the total number of row indices it touches is at most m. In 
other words, we can implement Opt using the oracle Val such that the total number of row indices 
touched by the resulting algorithm is no more than m. To see this, recall Lemma 23 that asserts 
that for any distribution p over columns, one of the indices in supp(p) must be a valid answer to the 
query Opt(p). Thus, to compute Opt(p) it is enough to simply read the entire rows whose indices 
are in supp(p) using repeated queries to Val, and manually compute the best performing expert 
over p. Notice that the total number of distinct row indices touched by this implementation is 
indeed no more than the total number of different indices in the supports of all input distributions 
to Opt, which is at most m. 

Hence, up to multiplicative constants in our bounds, we may restrict our attention to algorithms 
that do not use the optimization oracle Opt at all, and lower bound the number of distinct row 
indices they touch. Consider such an algorithm that attains average regret < 1 on some round 
t < n with probability at least ^ on the randomized construction we described. Notice that this 
property is essential for the expected average regret to be < ^ due to Markov’s inequality, so it is 
enough to focus exclusively on such algorithms. We will show that the expected number of distinct 
row indices the algorithm touches, and hence its expected runtime, is at least fI(\/]V). 

Denote by m the expected total number of distinct row indices touched, and for alH = 1,..., n, 
let mi be the expected number of distinct row indices from the set Xi the algorithm touches. 
Then, we have m > Xi,..., Xn is a partition of [N], For all i = 1,..., n, let pi be 

the probability that the algorithm picked expert x* on one of the rounds l,...,i of the game. 
Since detecting one of the good experts on time is necessary for obtaining a sublinear regret, the 
algorithm’s probability of attaining an average regret < 1 is upper bounded by Pi- This means 
that > ^, as we assume that the algorithm succeeds with probability at least 

On the other hand, observe that Lemma 24 implies pi < for each i, as any algorithm 
that makes no more than m* queries in expectation (and no more than in the worst case) to 
experts in the range Xi in the table of losses, cannot detect expert x* (that was chosen uniformly 
at random from this range) before round i with probability higher than notice that queries 
to other ranges in the table are irrelevant to this probability, since these ranges are constructed 
independently of X^. Hence, we obtain ^ < Yl'i=iPi — from which we conclude 

that m'>\n = \-/N■ This concludes the proof. □ 


B Lower Bound for Online Binary Classification 

In this section we extend our results to the setting of online binary classification. In this setting, 
the actions of the adversary are pairs (x, y) of a feature vector x and a binary label y G {0,1}. The 
loss function ^ then has additional structure: the loss of any expert (or hypothesis, in the context 
of classihcation) over the example (x, y) must be opposite to the loss of the same expert over the 
example (x, 1 — y). The optimization oracle Opt in this case receives a distribution over examples 
and emits the corresponding empirical risk minimizer—the hypothesis having the minimal loss with 
respect to the input distribution. 
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Since online binary classification is a special case of the optimizable experts setting (we merely 
impose additional constraints on the loss function), our algorithms and runtime upper bounds 
directly transfer to this specific case. However, the lower bounds do not directly apply: our con¬ 
structions of loss functions for the proofs of the lower bounds (in both Section 5 and Appendix A) 
do not necessarily admit the additional structure required by a loss function in the binary classi¬ 
fication setting. Nevertheless, below we show that the construction given in Appendix A can be 
adapted to binary classification, and thereby reprove the H(\/iV) runtime lower bound in the latter 
setting. 

First, let us define the setting more formally. In online binary classification, there is a finite 
set T-L oi N hypotheses, a set X of feature vectors, and a loss function \ H x {X x {Q, 1}) that 
assigns losses to all pairs of hypothesis h £ Ti and labeled example {x,y) G X x {0,1}. First, 
an adversary privately chooses an arbitrary sequence (xi, yi),..., (xt, Vt) £ X x {0, 1} of labeled 
examples. Then, on each round t = 1,... ,T, the player receives the feature vector xt and has to 
pick an hypothesis ht £ Ti, possibly at random; subsequently, the player suffers the loss i{ht]Xt,yt) 
and observes the label The goal of the player is to minimize the running time required to 
achieve e expected average regret, namely to reach 


E 


1 

T 


T 

'^t{ht;xt,yt) 

t=i 


1 

min — 
h&H T 


T 

J2^ih;xt,yt) 

t=i 


< e . 


The oracles Val and Opt are defined exactly as before: the value oracle satisfies \/a\{h] x,y) = 
£{h-,x,y) for all h £ Ti and (x,y) £ X x {0,1}; the optimization oracle accepts a distribution 
p £ A{X X {0,1}) and returns the hypothesis h £% that minimizes Yl{x ^^y)- 

In the (optimizable) online binary classification model, we prove: 

Theorem 25. Let N > 0 and fix % = X = [A^]. For any (randomized) regret minimization 
algorithm, there is a loss function i : H x {X x {0, 1}) i—)• [0,1], corresponding oracles Val, Opt, and 
a sequence of labeled examples (xi,yi), {x 2 ,y 2 ), ... G A" x {0,1} such that the runtime required for 
the algorithm to attain expected average regret smaller than ^ is at least fl{^/N). 

In order to prove the theorem, we adapt the construction of Appendix A as follows. Assume 
that N = for some integer n > 1, and let TL. = [A^] be the hypothesis class and X = [A^] be 
the set of possible feature vectors. Pick a set H* = {h \,..., /i* } FHoin “good” hypotheses, by 
choosing h( £ Hi = {n(i — 1) -|- 1,... , ni} uniformly at random for each i = 1,..., n. Also, for 
each feature vector x £ X choose a “good” label y*{x) £ {0,1} uniformly at random. Then, define 
losses for all pairs of hypothesis h £ H and example (x, y) £ X x {0,1} via: 


e{h]x,y) 


l{h,x) liy = y*{x), 

l-i{h,x) ify/y*(x). 


where t is the loss function constructed in Appendix A, namely: 


l{h, x) 


0 If h,x £ H* and h > x, 
1 otherwise. 


For this construction we can prove the next theorem, from which Theorem 25 immediately follows. 

^The hypothesis of choice ht is typically used to classify Xt via a classification rule (j), and the incurred loss is then 
a function of the classification 4>{ht,xt) and the true label yt- This is equivalent to our definition: any binary loss 
function £{h;x,y) can be equivalently written as £{(j){h, x),y) with a suitable (j> ■.'H x X {0,1}. 
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Theorem 26. Consider an adversary that picks the sequence {xi,yi),..., {xn,yn) of examples, 
where xt = and yt = y*{xt) for all t = 1,... ,n. Any online algorithm whose expected runtime 
is o{^/N) cannot attain expected average regret smaller than \ (at some point 1 <t <n during the 
game) on the sequence (xi,yi),..., (xn,yn)- 

The proof of Theorem 26 is very similar to the proof of Theorem 22: the only difference is in 
Lemma 23 that no longer applies for the new construction. However, we can prove the following 
analogue of that lemma in our current setup. 

Lemma 27. For any distribution p G A(T x {0,1}) over examples, one of the following statements 
must hold true: (i) there exists {x*,y*) G supp(p) such that h = x* is a valid answer to Opt(p); (ii) 
any h £ T-L\ H* is a valid answer to the query Opt(p). 

Intuitively, the lemma tells us that for the loss function i we constructed, the optimization 
oracle is completely redundant, as the output of a query Opt(p) can be implemented via a manual 
search over the support of p. In other words, the optimization oracle does not reduce the number 
of hypotheses we would have to inspect for minimizing the regret. 

Proof. Let S = {x : (x, y) G supp(p)}. Notice that if SCiH* = 0, i.e., p does not hit any of the good 
feature vectors (that correspond to the good hypotheses), then any hypothesis is a valid answer to 
Opt(p), since for any x G 5 and y G {0,1} it holds that i{h]x,y) = i{h';x,y) for all h,h' G Pi. 
In particular, any /i G 5 is valid answer to Opt(p) in this case. In all other cases, it is enough 
to consider only the elements in the intersection S' = S C H*, since atoms (x,y) G supp(p) with 
X ^ H* contribute to the losses of all hypotheses equally and do not affect the optimization for the 
best hypothesis with respect to p. 

For all 1 < f < n let pi = p{x*,y*{x*)) and = p{x*,l — y*{x*)). Also, for notational 
convenience, let Xq denote an arbitrary hypothesis from PL \ H*. Then, inspecting the structure 
of i, it follows that x) is a valid answer to Opt(p), where 

i* = arg min {p} H-h ft + Pi+i H-h Pn } , 

0<i<n 

and in case there are multiple minimizers, Opt chooses the one with smallest i*. Consider two cases: 

= 0 and i* > 1. In the first case, Xq is a valid answer to Opt(p), and the lemma’s claim holds 
true since Xq can be any hypothesis from PL \ H*. In the second case, we claim that it must be the 
case that pi* > 0: otherwise, we would have 

p'l H-h p'*_i + Pi* + Pi*+i H-hPn > p'l H-h p'._i + Pi* + Pi*+i H-hPn , 

which contradicts the optimality and minimality of i*. This means that x}* £ S' O S, and x}* is a 
valid response to Opt(p), which gives the lemma. □ 

Theorem 26 now follows via the same arguments we used for proving Theorem 22, using 
Lemma 27 in place of Lemma 23. 
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